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Abstract 

Random matrices are widely used in sparse recovery problems , and the relevant properties of matrices 
with i.i.d. entries are well understood. The current paper discusses the recently introduced Restricted 
Eigenvalue (RE) condition, which is among the most general assumptions on the matrix, guaranteeing 
recovery. We prove a reduction principle showing that the RE condition can be guaranteed by checking 
the restricted isometry on a certain family of low-dimensional subspaces. This principle allows us to 
establish the RE condition for several broad classes of random matrices with dependent entries, including 
random matrices with subgaussian rows and non-trivial covariance structure, as well as matrices with 
independent rows, and uniformly bounded entries. 



1 Introduction 

In a typical high dimensional setting, the number of variables p is much larger than the number of obser- 
vations n. This challenging setting appears in statistics and signal processing, for example, in regression, 
covariance selection on Gaussian graphical models, signal reconstruction, and sparse approximation. Con- 
sider a simple setting, where we try to recover a vector /3 G in the following linear model: 

Y = Xl3 + e. (1.1) 

Here X is an n x p design matrix, y is a vector of noisy observations, and e is the noise term. Even in the 
noiseless case, recovering /3 (or its support) from {X, Y) seems impossible when n <^ p, given that we have 
more variables than observations. 

A line of recent research shows that when /3 is sparse, that is, when it has a relatively small number of 
nonzero coefficients, it is possible to recover /3 from an underdetermined system of equations. In order to 
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ensure reconstruction, the design matrix X needs to behave sufficiently nicely in a sense that it satisfies 
certain incoherence conditions. One notion of the incoherence which has been fomiulated in the sparse 
reconstruction literature (Candes and Tao, 2005, 2006, 2007) bears the name of Uniform Uncertainty Prin- 
ciple (UUP). It states that for all s-sparse sets T, the matrix X restricted to the columns from T acts as an 
almost isometry. Let Xt, where T C {1, . . . ,p} be the n x \T\ submatrix obtained by extracting columns 
of X indexed by T. For each integer s = 1,2,... such that s < p, the s-restricted isometty constant 9s of 
X is the smallest quantity such that 

(1 - Os) \\c\\l < \\XTc\\l/n < (1 + Os) \\c\\l , (1.2) 

for all T C {1, . . . ,p} with |T| < s and coefficients sequences {cj)j^T- Throughout this paper, we refer to 
a vector /3 G with at most s non-zero entries, where s < p, as a s-sparse vector. 

To understand the formulation of the UUP, consider the simplest noiseless case as mentioned earlier, where 
we assume e = in (1.1). Given a set of values (( X*, /? ))"^]^, where X^,X'^, . . . are independent 
random vectors in W, the basis pursuit program (Chen et al., 1998) finds /3 which minimizes the ^i-norm of 
/?' among all /?' satisfying Xf3' = Xf3, where X is an x p matrix with rows X^,X'^, . . . , X". This can be 
cast as a linear program and thus is computationally efficient. Under variants of such conditions, the exact 
recovery or approximate reconstruction of a sparse /3 using the basis pursuit program has been shown in a 
series of powerful results (Donoho, 2006a, 2004; Candes et al., 2006; Candes and Tao, 2005, 2006; Donoho, 
2006b; Rudelson and Vershynin, 2006, 2008; Candes and Tao, 2007). We refer to these papers for further 
references on earlier results for sparse recovery. 

In other words, under the UUP, the design matrix X is taken as a n x p measurement ensemble through 
which one aims to recover both the unknown non-zero positions and the strength of a s-sparse signal /3 in 
W efficiently (thus the name for compressed sensing). Naturally, we wish n to be as small as possible for 
given values of p and s. It is well known that for random matrices, UUP holds for s = 0{n/ \og{p/n)) 
with i.i.d. Gaussian random entries, Bernoulli, and in general subgaussian entries (Candes and Tao, 2005; 
Rudelson and Vershynin, 2005; Candes and Tao, 2006; Donoho, 2006b; Baraniuk et al, 2008; Mendelson et al., 
2008). Recently, it has been shown (Adamczak et al., 2009) that UUP holds for s = 0{n/ \og^{p/n)) when 
X is a random matrix composed of columns that are independent isotropic vectors with log-concave den- 
sities. For a random Fourier ensemble, or randomly sampled rows of orthonormal matrices, it is shown 
that (Rudelson and Vershynin, 2006, 2008) the UUP holds for s = 0(n/ log'^p) for c = 4, which improves 
upon the earlier result of Candes and Tao (2006) where c = 6. To be able to prove UUP for random mea- 
surements or design matrix, the isotropicity condition (cf . Definition 1.5) has been assumed in all literature 
cited above. This assumption is not always reasonable in statistics and machine learning, where we often 
come across high dimensional data with correlated entries. 

The work of Bickel et al. (2009) formulated the restricted eigenvalue (RE) condition and showed that it is 
among the weakest and hence the most general conditions in literature imposed on the Gram matrix in order 
to guarantee nice statistical properties for the Lasso estimator (Tibshirani, 1996) as well as the Dantzig 
selector (Candes and Tao, 2007). In particular, it is shown to be a relaxation of the UUP under suitable 
choices of parameters involved in each condition; see Bickel et al. (2009). We now state one version of the 
Restricted Eigenvalue condition as formulated in (Bickel et al., 2009). For some integer < sq < p and a 
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positive number ko, RE(so, ko, X) for matrix X requires that the following holds: 



Vt; / 0, 
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,/C{l,...,p}, \\vjc\\^<ko\\vj\ 
\.J\<S0 
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\vj\\2 



> 



(1.3) 



where vj represents the subvector of t; G confined to a subset J of {1, ... In the context of 
compressed sensing, RE condition can also be taken as a way to guarantee recovery for anisotropic mea- 
surements. We refer to van de Geer and Buhlmann (2009) for other conditions which are closely related to 
the RE condition. 

Consider now the linear regression model in (1.1). For a chosen penalization parameter A„ > 0, regularized 
estimation with the £i-norm penalty, also known as the Lasso (Tibshirani, 1996) refers to the following 
convex optimization problem 



where the scaling factor l/(2n) is chosen for convenience. Under i.i.d Gaussian noise and the RE condi- 
tion, bounds on £2 prediction loss and on ig, 1 < q < 2, loss for estimating the parameter /5 in (1.1) for 
both the Lasso and the Dantzig selector have all been derived in Bickel et al. (2009). In particular, £2 loss 
of @{Xa^/s) were obtained for the Lasso under RE(s,3,X) and the Dantzig selector under RE(s, l,X) 
respectively in Bickel et al. (2009), where it is shown that RE(s, 1, X) condition is weaker than the UUP 
used in Candes and Tao (2007). 

RE condition with parameters sq and ko for random measurements / design matrix has been proved for a 
random Gaussian vector Raskutti et al. (2009, 2010) with a sample bound of order n = 0(so logp), when 
condition (1.3) holds for the square root of the population covariance matrix S. As we show below, the 
bound n = O(so logp) can be improved to the optimal one n = 0{so Iog(p/so)) when RE(so, ko, S^/^) is 
replaced with RE(so, {l+e)ko, S^/^) for any e > 0. The papers Raskutti et al. (2009, 2010) have motivated 
the investigation for a non-iid subgaussian random design by Zhou (2009a), as well as the present work. The 
proof of Raskutti et al. (2010) relies on a deep result from the theory of Gaussian random processes - Gor- 
don's Minimax Lemma Gordon (1985). However, this result relies on the properties of the normal random 
variables, and is not available beyond the Gaussian setting. To establish the RE condition for more general 
classes of random matrices we had to introduce a new approach based on geometric functional analysis. 
We defer the comparison of the present paper with Zhou (2009a) to Section 1.2. Both Zhou et al. (2009b) 
and van de Geer and Buhlmann (2009) obtained weaker results which are based on bounding the maximum 
entty-wise difference between sample and the population covariance matrices. We refer to Raskutti et al. 
(2010) for a more elaborate comparison. 

1.1 Notation and definitions 

Let ei, . . . , Bp be the canonical basis of M^. For a set J C {1, . . . ,p}, denote Ej = spanjcj : j G J}. For 
a matrix A, we use \\A\\2 to denote its operator norm. For a set y C W, we let conv V denote the convex 
hull of V. For a finite set Y, the cardinality is denoted by \Y\. Let B2 and S^^^ be the unit Euclidean ball 




(1.4) 
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and the unit sphere respectively. For a vector u S W, let utq be the subvector of u confined to the locations 
of its So largest coefficients in absolute values. In this paper, C, c, etc, denote various absolute constants 
which may change line by line. Occasionally, we use ut G M'-^I, where T C {1, . . . to also represent 
its 0-extended version u' G W such that u'j,c = and u'j, = ut- 

We define Cone(so, feo), where < sq < p and /cq is a positive number, as the set of vectors in which 
satisfy the following cone constraint: 

Cone(so, ^o) = {2^ G I^'' I 3/ G {1, . . . |/| = So s.t. ||x/c || < /co . (1.5) 

Let /3 be a s-sparse vector and /3 be the solution from either the Lasso or the Dantzig selector. One of the 
common properties of the Lasso and the Dantzig selector is: for an appropriately chosen A„ and under i.i.d. 
Gaussian noise, the condition 

v:=P-(3&Coxie{s,ko) (1.6) 

holds with high probability. Here feo = 1 for the Dantzig selector, and /co = 3 for the Lasso; see Bickel et al. 
(2009) and Candes and Tao (2007) for example. The combination of the cone property (1.6) and the RE 
condition leads to various nice convergence results as stated earlier. 

We now define some parameters related to the RE and spai^se eigenvalue conditions that are relevant. 
Definition 1.1. Let 1 < sq < p, and let ko be a positive number. We say that a q x p matrix A satisfies 
RE(so, fco, A) condition with parameter K[sq, ko, A) if for any t; / 0, 

^ ■ ■ M!^^n ni^ 
mm mm — — > U. (1-7) 



K{so,ko,A) JC{i,...,p}, ||^,jc||^<fco||'L',/||i \\vj\\2 

\J\<So 

It is clear that when so and k^ become smaller, this condition is easier to satisfy. 

Definition 1.2. For m < p, we define the largest and smallest m-sparse eigenvalue ofaqxp matrix A to 
be 

/Omax(?7T.,^) := , max pilli/ PII2 ' (1-^) 

tj!^0;m— sparse 

Pramim,A) := min pilli/ PII2 ■ (1-9) 

tj!^0;m— sparse 



1.2 Main results 

The main purpose of this paper is to show that the RE condition holds with high probability for systems 
of random measurements/random design matrices of a general nature. To establish such result with high 
probability, one has to assume that it holds in average. So, our problem boils down to showing that, un- 
der some assumptions on random variables, the RE condition on the covariance matrix implies a similar 
condition on a random design matrix with high probability when n is sufficiently large (cf. Theorems 1.6 
and Theorem 1.8). This generalizes the results on UUP mentioned above, where the covariance matrix is 
assumed to be identity. 
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Denote by ^ a fixed q x p matrix. We consider tlie design matrix X which can be represented as 



X = -^A, (1.10) 

where the rows of the matrix ^ are isotropic random vectors. An example of such a random matrix X con- 
sists of independent rows, each being a random vector in MP that follows a multivariate normal distribution 
A^(0, S), when we take A = S^/^ in (1.10). Our first main result is related to this setup. We consider a ma- 
trix represented as X = ^A, where the matrix A satisfies the RE condition. The result is purely geometric, 
so we consider a deterministic matrix ^. 

We prove a general reduction principle showing that if the matrix ^ acts as almost isometry on the images 
of the sparse vectors under A, then the product ^A satisfies the RE condition with a smaller parameter ko. 
More precisely, we prove Theorem 1.3. 

Theorem 1.3. Let 1/5 > 6 > 0. Let < sq < p and > 0. Let A be a q x p matrix such that 
RE(so,3A:o,^) holds for Q < K{so,3ko,A) < oo. Set 

, ^ II, „2 16i^^(5o,3A;o,^)(3fco)2(3fco + l) 

d = So + ■So max L4e,- L , (l-H) 

and let E = [J\j\^^Ej for d < p and E denotes WP otherwise. Let ^ be a matrix such that 

\/x^AE (1 - (5) ||x||2 < §x ^ < (l + (5)||x||2. (1.12) 

Then RE(so, k^, ^^A) condition holds for matrix ^A with < K{sq, ko, ^A) < K(sq, k(),A) /(I — 56). 
Remark 1.4. We note that this result does not involve /9max(s0) ^^or the global parameters of the matrices 
A and ^, such as the norm or the smallest singular value. We refer to Raskutti et al. (2010) for examples of 
matrix A, where Pmaxiso, A) grows with sq while the RE condition still holds for A. 

The assumption RE(so, 3ko,A) can be replaced by RE(so, (1 + e)feo, A) for any e > by appropriately 
increasing d. See Remark 2.6 for details. 

We apply the reduction principle to analyze different classes of random design matrices. This analysis is 
reduced to checking that the almost isometry property holds for all vectors from some low-dimensional 
subspaces, which is easier than checking the RE property directly. 

The first example is the matrix ^ whose rows are independent isotropic vectors with subgaussian mai^ginals 
as in Definition 1.5. This result extends a theorem of Raskutti et al. (2010) to a non-Gaussian setting, in 
which the entries of the design matrix may even not have a density. 
Definition 1.5. Let Y be a random vector in 

7. Y is called isotropic if for every y G MP, E,\{Y,y)\^ = \\y\\2- 

2. Y is il)2 with a constant a if for every y G M^, 

||(y,y)||^^ := inf{t:Eexp((y,y)Vt') <2} < a ||y||2 . (1.13) 
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The ^2 condition on a scalar random variable V is equivalent to the subgaussian tail decay of V , which 
means 



f{\V\ >t)< 2exp(-tVc^), for all t > 0. 

Throughout this paper, we use V'2> vector with subgaussian marginals and subgaussian vector interchange- 
ably. Examples of isotropic random vectors with subgaussian mai^ginals ai^e: 

• The random vector Y with i.i.d A^(0, 1) random coordinates. 

• Discrete Gaussian vector, which is a random vector taking values on the integer lattice with distri- 
bution F{X = m) = Cexp(- ||m||2 /2) for m G ZP. 

• A vector with independent centered bounded random coordinates. The subgaussian property here 
follows from the Hoeffding inequality for sums of independent random variables. This example in- 
cludes, in particular, vectors with random Bernoulli coordinates, in other words, random vertices of 
the discrete cube. 

It is hard to argue that such multivariate Gaussian or Bernoulli random designs are not relevant for statistical 
applications. 

Theorem 1.6. Set < 5 < 1, A;o > 0, and < sq < p. Let A be a q x p matrix satisfying RE(so, S/cq, ^) 
condition as in Definition 1.1. Let d be as defined in (1.11), and let m = min(d,p). Let ^ be an n x q 
matrix whose rows are independent isotropic ^^2 random vectors in with constant a. Suppose the sample 
size satisfies 

Then with probability at least 1 — 2 exp ((52n/2000a^), RE(so,/co, {l/^/n)^A) condition holds for matrix 
{l/y/n)'^Awith 

< K{so, ko, (l/M^A) < (1.15) 

Remark 1.7. We note that all constants in Theorem 1.6 are explicit, although they are not optimized. 

Theorem 1.6 is applicable in various contexts. We describe two examples. The first example concerns cases 
which have been considered in Raskutti et al. (2010); Zhou (2009a). They show that the RE condition on the 
covariance matrix S implies a similar condition on a random design matrix X = ^'E^/^ with high proba- 
bility when n is sufficiently large. In particular, in Zhou (2009a), the author considered subgaussian random 
matrices of the form X = ^S^/^ where S is ap xppositive semidefinite matrix satisfying RE(so, ^o, S^/^) 
condition, and ^ is as in Theorem 1.6. Unlike the current paper, the author allowed pma.x{so, 5]^/^) as well 
as K^{so, ko, S^/^) to appear in the lower bound on n, and showed that X/ y/n satisfies the RE condition as 
in (1.15) with overwhelming probability whenever 

n > ^^(2 + kofK^iso, ko, S^/^) niin(4pmax(so, ^'^^^)so log(5ep/so), so logp) (1.16) 



6 



where the first term was given in Zhou (2009b, Theorem 1.6) explicitly, and the second tenii is an easy 
consequence by combining arguments in Zhou (2009b) and Raskutti et al. (2010). Analysis there used 
Corollary 2.7 in Mendelson et al. (2007) crucially. 

In the present work, we get rid of the dependency of the sample size on pmaxCsO) although un- 

der a slightly stronger RE(so, 3A;o, S^^^) (See also Remark 2.6). More precisely, let S be a p x p co- 
variance matrix satisfying RE(so, S/cq, S^^^) condition. Then, (1.15) implies that with probability at least 

1 - 2exp{5^n/2000a^), 

< K{so, ko, ( < Kiso '.o,^'^') (1.17) 

1 — 

where n satisfies (1.14) for d defined in (1.11), with A replaced by S^/^. 

Another application of Theorem 1 .6 is given in Zhou et al. (2009a). The qxp matrix A can be taken as a data 
matrix with p attributes (e.g., weight, height, age, etc), and q individual records. The data are compressed 
by a random linear transformation X = "^A. Such transformations have have been called "matrix masking" 
in the privacy literature (Duncan and Pearson, 1991). We think of X as "public," while which is a n x g 
random matrix, is private and only needed at the time of compression. However, even with known, 
recovering A from requires solving a highly under-deteiTnined linear system and comes with information 
theoretic privacy guarantees when n ^ g, as demonstrated in Zhou et al. (2009a). On the other hand, sparse 
recovery using X is highly feasible given that the RE conditions are guaranteed to hold by Theorem 1.6 
with a small n. We refer to Zhou et al. (2009a) for a detailed setup on regression using compressed data as 
in (1.10). 

The second application of the reduction principle is to the design matrices with uniformly bounded en- 
tries. As we mentioned above, if the entries of such matrix are independent, then its rows are subgaussian. 
However, the independence of entries is not assumed, so the decay of the marginals can be arbitrary slow. 
A natural example for compressed sensing would be measurements of random Fourier coefficients, when 
some of the coefficients cannot be measured. 

Theorem 1.8. Let < 6 < 1 and < sq < p. Let Y be a random vector such that \\Y\\^ < a.s 
and denote S = 'KYY'^ . Let X be an n x p matrix, whose rows Xi , . . . , Xn are independent copies of Y. 
Let S satisfy the RE(so, 3/co, S^/^) condition as in Definition 1.1. Let d be as defined in (1.11), where we 
replace A with T}/'^. Assume that d < p and p = Pmin{d, S^/^) > 0. Suppose the sample size satisfies for 
some absolute constant C 

CM'^d • logn o fCM'^d ■ logn\ 
[ )■ 

Then with probability at least 1 — exp (^—6pn/{6M'^d)^, RE(so, kQ,X) condition holds for matrix Xj ^Jn 
with < K(so, k^.Xl^)) < K{so, ko, SV2)/(1 _ s). 

Remark 1.9. Note that unlike the case of a random matrix with subgaussian marginals, the estimate of 
Theorem 1.8 contains the minimal sparse singular value p. We will provide an example illustrating that this 
is necessary in Remark 4.4. 

We will prove Theorems 1.3, 1.6, and 1.8 in Sections 2, 3, and 4 respectively. 
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We note that the reduction principle can be applied to other types of random variables. One can consider the 
case of heavy-tailed marginals. In this case the estimate for the images of sparse vectors can be proved using 
the technique developed by Vershynin (201 la,b). One can also consider random vectors with log-concave 
densities, and obtain similar estimates following the methods of Adamczak et al. (2009, 2011). We leave 
the details for an interested reader. 

To make our exposition complete, we will show some immediate consequences in terms of statistical in- 
ference on high dimensional data that satisfy such RE and sparse eigenvalue conditions. We discuss in 
Section 1.3 some bounds for the Lasso estimator for such a subgaussian random ensemble. In particular, 
bounds developed in the present paper can be applied to obtain tight convergence results for covariance 
estimation for a multivariate Gaussian model Zhou et al. (201 1). 

1.3 Convergence rates in sparse recovery 

Lasso and the Dantzig selector are both well studied and shown to have provable nice statistical proper- 
ties. For results on variable selection, prediction error and ip loss, where I < p < 2 under various in- 
coherence conditions, see, for example Greenshtein and Ritov (2004); Meinshausen and Buhlmann (2006); 
Zhao and Yu (2006); Bunea et al. (2007); Candes and Tao (2007); Koltchinskii (2009a); van de Geer (2008); 
Zhang and Huang (2008); Wainwright (2009); Candes and Plan (2009); Bickel et al. (2009); Cai et al. (2010); 
Koltchinskii (2009b); Meinshausen and Yu (2009). As mentioned, the restricted eigenvalue (RE) condi- 
tion as formulated by Bickel et al. (2009) are among the weakest and hence the most general conditions 
in literature imposed on the Gram matrix in order to guarantee nice statistical properties for the Lasso 
and the Dantzig selector. For a comprehensive comparison between some of these conditions, we refer 
to van de Geer and Buhlmann (2009). 

For random design as considered in the present paper, one can show that various oracle inequalities in terms 
of £2 convergence hold for the Lasso and the Dantzig selector as long as n satisfies the lower bounds above. 
Let s = |supp /3| for /3 in (1.1). Under RE(s, 9, S^/^), a sample size of n = 0{slog{p/s)) is sufficient 
for us to derive bounds corresponding to those in Bickel et al. (2009, Theorem 7.2). As a consequence, 
we see that this setup requires 0(log(p/s)) observations per nonzero value in /3 where hides a constant 
depending on K'^{s, 9, S^/^) for the family of random matrices with subgaussian marginals which satisfies 
RE(s, 9, S^/^) condition. Similarly, we note that for random matrix X with a.s. bounded entries of size M, 
n = 0{sM^ logplog^(s logp)) samples are sufficient in order to achieve accurate statistical estimation. We 
say this is a linear or sublinear sparsity. For p ^ n, this is a desirable property as it implies that accurate 
statistical estimation is feasible given a very limited amount of data. 

As another example, assume that /9max(s, S^/^) is a bounded constant and pinin(s, S^/^) > 0. We note 
that this slight restriction on Pma^is, S^/^) allows one to derive an oracle result on the £2 loss as studied 
by Donoho and Johnstone (1994); Candes and Tao (2007); Zhou (2009b, 2010)), which we now elaborate. 
Let e ~ A'^(0, cr^I) in (1.1). Assume that RE(so, 12, S^/^) holds, where Sjj = 1, Vi and so is defined as the 
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smallest integer such that 
p 

^min(/32,AV2) < soAV^ where A = V'21ogp/n. (1.18) 

i=l 

We note that as a consequence of this definition is: < Acrforallj > so> if we order |/3i| > |/32|--- > |/3p|; 
see Candes and Tao (2007). Hence sq essentially characterizes the number of significant coefficients of /3 
with respect to the noise level a. Following analysis in Zhou (2010), one can show that the Lasso solution 
satisfies 

||^-/3||ixsoAV, (1.19) 
with overwhelming probability, as long as 

n > Cmlog{cp/m) (1.20) 
where m = max(s, d) for d as defined in (1.11) with S^/^ replacing A. 

One can also show the same bounds on £i loss and prediction error as in Zhou (2010) under this setting. 
The rate of (1.19) is an obvious improvement upon the rate of Q{Xa^/s) when sq is much smaller than s, 
that is, when there are many non-zero but small entries in /3. Moreover, given such ideal rate on the ^2-loss, 
it is shown in Zhou (2009b, 2010) that one can then recover a sparse model of size x 2so such that the 
model contains most of the important variables while achieving such oracle inequalities as in (1.19), where 
thresholding of the Lasso estimator followed by refitting has been applied. Such results have also been used 
in Gaussian Graphical model selection to show fast convergence rates in estimating the covaiiance matrix 
and its inverse Zhou et al. (201 1). 

Conceptually, results in the cuiTcnt paper allow one to extend such oracle results in terms of £2 loss from 
the family of random matrices obeying the UUP to a broader class of random matrices that satisfy the 
RE condition with sample size at essentially the same order. When S is ill-behaving in the sense that 
Pmax("i5 S^/^) grows too rapidly as a function of m, we resort to the bound of 0{Xay/s) which corresponds 
to those derived in Bickel et al. (2009), under RE(s, 9, S). 

Finally, the incoherence properties for a random design matrix that is the composition of a random matrix 
with a deterministic matrix have been studied even earlier, see for example Rauhut et al. (2008); Zhou et al. 
(2009a), in the context of signal reconstmction and high dimensional spai^se regressions. 



2 Reduction principle 

We first reformulate the reduction principle in the form of restrictive isometry: we show that if the matrix 
^ acts as almost isometry on the images of the sparse vectors under A, then it acts the same way on the 
images of a set of vectors which satisfy the cone constraint (1.5). We then prove Theorem 1.3 as a corollary 
of Theorem 2.1. 
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Theorem 2.1. Let 1/5 > (5 > 0. Let < sq < p and /cq > 0. Let A be a q x p matrix such that 
RE(so, 3/co, ^) condition holds for < K{sq, ?>kQ,A) < oo. Set 

, ^ ... f l6K^iso,3 ko,A){3kof{3ko + l] 
a = Sq + Sq max ||y4e,-||r, 
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an^Z Zef = yj\j\=^Ej for d < p and E = MP otherwise. Let be a matrix such that 



'ixeAE {l-5)\\x\\^< 
Then for any X € A( Cone(so, fco) ) n S''^"^, 



<(1 + 5)||X||2. 



(1 - 55) < 



< (1 + 35) 



(2.1) 



(2.2) 



Proof of Theorem 1.3. By the RE(so, 3A;o, ^) condition, RE(so, k^, A) condition holds as well. Hence 
for u G Cone(so5 ^o) such that u / 0, 



and by (2.2) 



^Au 



11^^112 ^ K(^A^ > 0' 
K{sQ,kQ,A) 



^>(l-5i)M.||,>(l-5*)J^2^ 



> 0. 



□ 



The proof of Theorem 2. 1 uses several auxiliary results, which will be established in the next two subsec- 
tions. 



2.1 Preliminary results 



Our first lemma is based on Maurey's empirical approximation argument Pisier (1981). We show that any 
vector belonging to the convex hull of many vectors can be approximated by a convex combination of a few 
of them. 

Lemma 2.2. Let ui, . . . , um G K"^. Let y G conv {ui, . . . , um)- There exists a set L C {1, 2, . . . , M} such 
that 



\L\ < m 

and a vector y' £ conv (uj ,j € L) such that 



4maXjg|i^...^Af} 11% I 



W - yL < e. 
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Proof. Assume that 

y= «, 

je{i,...,A/} 

Let y be a random vector in such that 



where > 0, and = 1- 



P(F = n^) = a^, £g {1,...,M} 

Then 

Ey = ^ aiuii = y. 
te{i,...,A/} 

Let Yi, . . . , Ym be independent copies of Y and let ei, be ±1 i.i.d. mean zero Bernoulli random 

variables, chosen independently of Yi, . . . , Ym- By the standard symmetrization argument, we have 



E 



< 4E 



m 



m 

2 



2 ^ 4max^g{i_..._M} I l^dl2 ^ ^2 3) 



where 

E llYo-llo < sup llyllo < max \\uA\o 

II JII2 - Fll JII2 -^^^^^^^^^^^11 *II2 

and the last inequality in (2.3) follows from the definition of m. 
Fix a realization Yj = Uk^, j = 1, . . . , m for which 



^ m 

y--E^. 



< e. 



The vector ^ Z^JLi ^ belongs to the convex hull of {ug : £ G L}, where L is the set of different elements 
from the sequence ki, . . . ,km. Obviously \L\ < m and the lemma is proved. □ 

For each vector x G W, let Tq denote the locations of the sq largest coefficients of x in absolute values. 
Any vector x G Cone(so, k^) n 5^^"*^ satisfies: 



I ^To 1 1 2 



\xTs\\^<\\xTo\\l/sO < 



\xT§\\-^^< ko^/s^\\xTJ^ < A;oV^; and ||xTg-||2 < 1- 



(2.4) 
(2.5) 



The next elementary estimate will be used in conjunction with the RE condition. 

Lemma 2.3. For each vector v G Cone(so, k^), let Tq denotes the locations of the sq largest coefficients of 
V in absolute values. Then 



(2.6) 
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Proof. By definition of Cone(so, ko), by (2.4) 



Therefore \\v\ 



llt'ToHIa + II^Toll2 < (^0 + 1) ||t^Toll2- 



□ 



The next lemma concerns the extremum of a linear functional on a big circle of a g-dimensional sphere. We 
consider a line passing through the extreme point, and show that the value of the functional on a point of the 
line, which is relatively close to the extreme point, provides a good bound for the extremum. 
Lemma 2.4. let u,9,x be vectors such that 

1. \\e\\^ = i. 

2. {x,6)^0. 

3. Vector u is not parallel to x. 
Define (jj-.^^^by: 



{ X + An, 6 ] 



\X + Ali||r 



(2.7) 



Assume (j){X) has a local maximum at 0, then 

{x + u,6] 



{x,e) 



> 1 



\u\ 



Proof. Let v = . Also let 



Define / : 



Then 



and u 
Ihy: 



l3v + jt, where t _L v, \\t\\2 = 1 and + 7 = 1, /? / 
r]v + fit + s where s -L v and s -L t 



/(A) 



A 



|x||2 + Xt] ' 



(2.8) 



<A(A) 



{x + Xu,d) _ { {\\x\\^ + X7])v + Xfit + As, I3v + -ft) 
\\x + Au||2 ll(l|2;||2 + Xr])v + Xfit + Asllg 

\x\\2 + Ar?) + A/i7 



1x112 + Ar?)2 + (A^)2 + A2||s| 



/3 + /"7/(A) 



l + il^' + \\s\\l)PiX) 



Since /(A) = p|p + O(A^) we have 0(A) = /? + A*7p|p + O(A^) in the neighborhood of 0, Hence, in 
order to for 0(A) to have a local maximum at 0, fi or 7 must be 0. Consider these cases separately. 
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• First suppose 7 = 0, then = 1 and \ {x,9)\ = \\x\\2- Hence, 

{x + u,9) _ {u,e) ^ \{u,e)\ ^ 
{x,e) {x,e)- \{x,e)\- 

where \{u,6)\ < \\u\\2. 

• Otherwise, suppose that /x = 0. Then we have \r]\ = \ {u,v)\ < \\u\\2 and 

{x + u,e) ^ ^ ^ {'nv + s,(3v + jt) ^ ^ ^ 7]^ ^ ^ ^ 7] ^ ^ _ Ihlla 
{x,9) {v \\x\\2 , f3v + ^yt) \\x\\2P \\^\\2 ~ 11^112 

where we used the fact that /? 7^ given {x,9) 7^ 0. 

□ 



2.2 Convex hull of sparse vectors 

For a set J C {1, . . . , p}, denote Ej = spanjcj : j G J}. In order to prove the restricted isometry property 
of ^ over the set of vectors in A ^Cone(sO) ^0)^ H S'^~^, we first show that this set is contained in the convex 
hull of the images of the sparse vectors with norms not exceeding (1 — d)~^. More precisely, we prove the 
following lemma. 

Lemma 2.5. Let 1 > 5 > 0. Let < sq < P cind > 0. Let Abe a q x p matrix such that RE(so, ko,A) 
condition holds for < K{sq, k^^A) < 00. Define 

A Ml. ^ \\A ,,2( ^QK^{s^M,A)kl{ko + l) \ 

d = d{ko,A) = So + somax||Aej||2 I — I • (2.9) 

Tlwn 



A(Cone{so, /to)) n S'^'^ C (1 - (5)"^ conv ( |J AEj n S"'^ j 

\\j\<d J 



(2.10) 



where for d > p, Ej is understood to be . 



Proof. Without loss of generality, assume that d(/co, A) < p, otherwise the lemma is vacuously true. For 
each vector x G M*', let To denote the locations of the sq largest coefficients of x in absolute values. 
Decompose a vector x G Cone(so, ^0) H S"^"^ as 

X = XTo + XT^ G XTo + ko lla^Tolli absconv(eo | j G Tq), where IIxtqIU > ^^^= by (2.6) 

Vko + 1 

and hence 

Ax G AxTa + ^0 Ikrolli absconv(^ej | j G Tg). 

Since the set ACone{so, ko) n 5''^^ is not easy to analyze, we introduce set of a simpler stmcture instead. 
Define 

V = {xto + ko \\xTo\\i absconv {ej \ j G Tq)\x G Cone(so, ko) n S^^^ } . 
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For a given x G Cone(so, A^o) H 5^"^, if Tq is not uniquely defined, we include all possible sets of Tq in the 
definition of V . Cleai^ly V C Cone(so, k^) is a compact set. Moreover, V contains a base of Cone(so, ko), 
that is, for any y G Cone(so, k^) \ {0} there exists A > such that Ay G V. 

For any t; G such that II2 7^ 0, define 

Av 



F{v) 



By condition RE(so, ^0, the function F is well-defined and continuous on Cone(so, fco) \ {0}, and, in 
particular, on V. Hence, 

^Cone(so, k^) n 5"^^ = F(Cone(so, k^) \ {0}) = F{V). 

By duality, inclusion (2.10) can be derived from the fact that the supremum of any linear functional over 
the left side of (2.10) does not exceed the supremum over the right side of it. By the equality above, it is 
enough to show that for any 9 G S"^~^, there exists z' {0} such that | supp {z')\ < d and F{z') is well 

defined, which satisfies 

max(F(i;),6l) < (1 - F(z'), 6*). (2.11) 

For a given 9, we construct a d-sparse vector z' which satisfies (2. 11). Let 

z := arg max( F{v),9). 

By definition of V there exists / C {1, . . . ,p} such that |/| = sq, and for some Sj G {1, —1}, 

z = zi + \\zi\\-, ko ajEjej, where aj G [0, 1], aj < 1, and 1 > H-z/IU > . (2.12) 

Note if = 1 for some i G P, then 2; is a sparse vector itself, and we can set 2' = 2; in order for (2.11) to 
hold. We proceed assuming ai G [0, 1) for all i ^ P m (2.12) from now on, in which case, we construct 
a required sparse vector z' via Lemma 2.2. To satisfy the assumptions of this lemma, denote e^+i = 0, 
Ep+i = 1 and set 

Op+i = 1 — ^ Oj, hence a^+i G [0, 1]. 

Let 

y := Azjc = \\zi\\^ k^ ajSjAej = \\zj\\^ ko ajejAej 
j&i" ie/=u{p+i} 

and denote A4 := {j ^ F U {p + 1} : aj > 0}. Let e > be specified later. Applying Lemma 2.2 with 
vectors Uj = /cq II -2/11 1 ^jAcj for j G 7W, construct a set J' C Ai satisfying 



Q 2 ii2 ^ ii2 

^ 4maxjg/c /cq 11^/11;^ ||Aej||2 ^ 4A;qSo maxjg/c ||^ej II2 



£2 - £2 



(2.13) 
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and a vector 

y = fco ^ /Jjej^Cj where for / C M.fij G [0, 1] and ^ /3j = 1 
ieJ' j6J' 

such that — < e. 

Set n := ko \\zi\\-^ YljeJ' l^j^j^j ^"^^ 

z' = zj + u. 

By construction, Az' G AEj, where J := (/ U J') n {1, . . . ,p} and 

\J\<\I\ + \J'\<so + m. (2.14) 

Furthermore, we have 

\\Az - Az'W^ = \\A{zic - u)\\2 = \\y - y'\\^ < e 

For {/3j , j G J'} as above, we extend it to {/3j , j G F U {p + 1}} setting /3j = for all j G F U {p + 1} \ J' 
and write 

z = zi + ko \\zi\\^ ^ Pj^j^j where /3j G [0, 1] and ^ /3j = 1. 

ie/=u{p+i} je/^u{p+i} 

If z' = z, we are done. Othewise, for some A to be specified, consider the vector 

z + X{z' - z) = zj + ko\\zi\\-^ ^ [{1 - X)aj + Xf3j]ejej. 

We have J2jei-u{p+i} [(1 " + ^^j] = 1 ^nd 

3 (5o > s. t. Vj G r U {p + 1}, (1 - X)aj + X/Sj G [0, 1] if |A| < Jq. 
To see this, we note that 

• This condition holds by continuity for all j such that Oj G (0, 1). 

• If aj = for some j, then /3j = by construction. 

Thus Yljei- [(1 - + ^/^il < 1 and z + A(z' -z) = zi + ko \\zi\\^ Y^jei- [(1 " + ^/^il ^j^j ^ V 
whenever |A| < Jq- 

Consider now a function </) : (— 5o, 5o) — ^ 

{Az + X{Az' -Az),e) 



cPiX) :={F{z + X{z' -z)),e) 



\\Az + X{Az' - Az)\\. 



Since z maximizes ( F{v), 9 ) for all v & V, (j){X) attains the local maximum at 0. Then by Lemma 2.4, we 
have 

{Az',e) _ {Az + {Az'-Az),9) ^ \\{Az' - Az)\\, _ \\Az\\, - \\{Az' - Az)\\, 
{Az,e) {Az,e) - \\Az\\, \\Az\\, 
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hence 



{F{z'),e) _ {Az'/\\Az'\\„e) _ \\Az\\, {Az',e) 

X 



{F{z),e) 



> 



{Az/\\Az\\^, 
\\Az\ 



\Az' 



\Az 
\Az 



\Az 
\Az 



\Az 



^ + \\^Az>-Az)\\, 
2-\\iAz'-Az)\\, 



12 (A^^O) 
\\Az\\,-\\{Az' -Az)\\, 

\\Az\\n 



+ \\iAz'-Az) 



1 



2e 



12 + e 11^-2^112 + ^ 

By definition, z G Cone(so, ko). Hence we apply RE(A;o, sq, A) condition and (2.12) to obtain 

11 , 11 \\zi\\o 1 

"2 - K{so,ko,A) - y/TTk^K{so,ko,A)' 



Now we can set e 



2VT+h)K{so,koA) 



which yields 

{F{z),e) - 



(2.15) 



and thus (2.11) holds. Finally, by (2.13), we have 



„2 f l6K\s o,ko,A)k^oiko + l) 
m < sq max \\Aej I 



and hence the inclusion (2.10) holds in view of (2.14) and (2.15). 



□ 



2.3 Proof of the reduction principle 

To prove the restricted isomorphism condition (2.2), we apply Lemma 2.5 with /cq being replaced by S/cq- 
The upper bound in (2.2) follows immediately from the lemma. To prove the lower bound, we consider a 
vector X G Cone(so, ko) as an endpoint of an interval, whose midpoint is a sparse vector from the same cone. 
Then the other endpoint of the interval will be contained in the larger cone Cone(so, 3A;o). Comparison 
between the upper estimate for the norm of the image of this endpoint with the lower estimate for the 
midpoint will yield the required lower estimate for the point x. 

Proof of Theorem 2.1. Let v G Cone(so> S/cq) \ {0}, and so ||j4u||2 > by RE(so, 3A;0) ^) condition. 
Let d{3kQ,A) be defined as in (2.9). As in the proof of Lemma 2.5, we may assume that d{3ko,A) < p. By 
Lemma 2.5, applied with ko replaced with Sko, we have 



Av 



and 



G ^(Cone(so,3fco)) n5^-^ C (1 - 5)"^conv I |J AEj n S"'^ 

AJ\=d{3ko,A) 



\\M\, 



< 



max 



1 — 5 M6conv(A£;nS'J-i) 



1 



max 



2 1 — 6 u&AEnsi-^ 
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The last equality holds, since the maximum of ||^tt||2 occurs at an extreme point of the set conv {AEnS'^ 
because of convexity of the function f{x) = ||^2;||2. Hence, by (2.1) 

Vx G A(Cone{so, 3A;o)) n S"'^, ^ < (1 + - 5)"^ < 1 + 35 (2.16) 

where the last inequality is satisfied once 6 < 1/3, which proves the upper estimate in (2.2). 

We have to prove the opposite inequality. Let x = xi + xjc G Cone(soi ^o) H 5^"^, where the set / contains 
the locations of the sq largest coefficients of x in absolute values. We have 

X = xj + ||x7c||j^ ^ ^^-^ sgn(xj)ej, where 1 > ||x/||2 > ^ by (2.6) (2.17) 



Lete > be specified later. We now construct a (i(3/co, A)-spai-se vector y = xj + u ^ Cone(so, ko), where 
u is supported on P which satisfies 



kill = = Ik/Hli and \\Ax- Ay\\2 = \\A{xic -yic)\\^<e 



(2.18) 



To do so, set 



w := Axjc = ||2;/c||-^ - — sgn(xj)^ej. 



Xjc 



Let M := {j ^ I'^ : xj / 0}. Applying Lemma 2.2 with vectors uj = ||x/c||j^ sgn(xj)j4ej for j e M, 
construct a set J' C A4 satisfying 



(2.19) 



and a vector 



w 



' = Ik/Hli ^ /?jSgn(xj)Aej, where for J' C M, f3j G [0, 1] and ^ /5j = 1 

ieJ' j&J' 



such that ll^x — ^y||2 = ||w^' — u^||2 ^ Set u := ||x/c||-|^ SjeJ' Pj^§>^{^j)^j 

y = XI + u = XI + ||x7c||-^ ^ f3jSgn{xj)ej where (3j G [0, 1] and ^ /3j = 1. 

j&J' j&J' 

By construction, y G Cone(so, ko) n Ej, where J := / U J' and 

I J| = |/| + I J'l < So + m. 



(2.20) 



This, in particular, implies that ||j4y||2 > 0. Assume that e is chosen so that so + m < d{3ko,A), and so by 
(2.1) 



-^Ay 
W2 



> 1-5. 
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Set 

V = XI + 2yic - xic = y + {yjc - xjc). (2.21) 

Then (2.18) implies 

\\Av\\^ < \\Ay\\^ + \\Aiyi.-xic)\\ < \\Ay\\^ + s, (2.22) 

and V G Cone(so, 3A:o) as 

< 2 + ||2;/c||-|^ = 3 ||a;/c||-|^ < S/cq ||x/||-^ = S/cq 

where we use the fact that ||x/c||-^ = Hence, by the upper estimate (2.16), we have 

ifAv 

2 



\Av\ 



<(l + 5)(l_<5)"i 



(2.23) 



Since y = h{x + v), where yj = xj, we have by the lower bound in (2.1) and the triangle inequality. 



1-6 < 



< 


^Ay 






\\Ay\\, 


2 ^ 



"^Ax 



+ 



"^Av 




\\Ayh' 

\\M\2 + £ 



where in the second line, we apply (2.22) and (2.18), and in the third line, (2.23). By the RE{so, ko, A) 
condition and (2. 17) we have 

1 



\\M\2 > 



> 



Set 



Thenfor (5 < 1/5 



K{so,ko,A) K{so,ko,A) K{so,ko,A) -y/ko + l 
s . Py|l2 + e 



QVTTk^K{so,ko,A) 



so that 



\\Ay\\, 



< (1 + 6/6). 



^Ax 



\Ax\ 



> 2- 



1-6 
'1 + 6/6 



{l + 6){l-6)-^ > 1-5(5. 



This verifies the lower estimate. It remains to check the bound for the cardinality of J. By (2.19) and (2.20), 
we have for ko > 0, 

,2 fl6K\so,ko,a){3ko)Hko + '^y 



\J\ < So + m < So + somaxWAe-jllt, [ - 
as desired. This completes the proof of Theorem 2. 1 . 



<52 



< d{2,ko,A) 



□ 
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Remark 2.6. Let e > 0. Instead of v defined in (2.21), one can consider the vector 

Ve = XI + y - e{x - y) G Cone(so, (1 + e)ko) . 

Then replacing v by throughout the proof, we can establish Theorem 2. 1 under the assumption RE(so, (1+ 
£)kQ^A) instead o/RE(so, 3A;o, A), if we increase the dimension d{3kQ) by a factor depending on e. 



3 Subgaussian random design 

Theorem 1.6 can be reformulated as an almost isometry condition for the matrix X = "if A acting on the set 

Cone(so> ko). Recall that 

,^ ^ 1,2 f l6K^so,3ko,A){3koy{3ko + l) \ 
d[6ko,A) = So + SQma,x\\Aej\\2 I p I • 

Theorem 3.1. Set < S < 1, < sq < p, and fco > 0. Let A be a q x p matrix satisfying RE(so, S/cqi ^) 
condition as in Definition LL Let m = va.m.[d{'ikQ, A),p) < p. Let ^ be an n x q matrix whose rows are 
independent isotropic ip2 random vectors in with constant a. Assume that the sample size satisfies 

Then with probability at least 1 — 2exp((5^n/2000a^), /or a// v G Cone(so, ko) such that u / 0, 



Theorem 1.6 follows immediately from Theorem 3.1. Indeed, by (3.2), for all u E Cone(so, fco) such that 



n 



>{l-6) \\An\\, > (1 - 6) J™\. > 0. 

K{so,ko,A) 

To derive Theorem 3.1 from Theorem 2.1 we need a lower estimate for the norm of the image of a sparse 
vector. Such estimate relies on the standard e-net argument similaiiy to Mendelson et al. (2008, Section 3). 
Theorem 3.2. Set < 5 < 1. Let A be a q x p matrix, and let ^ be an n x q, matrix whose rows are 
independent isotropic 11)2 random vectors in with constant a. For m < p, assume that 

8(W /12ep\ 
n > 2— log • (3-3) 



Then with probability at least 1 — 2 exp(— r^n/SOa^), for all m-sparse vectors u in W, 

il-T)\\Au\\, < ^Il^^^ll2 < (.l + r)\\An\\^. (3.4) 
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We note that Theorem 3.2 does not require the RE condition to hold. No particular- upper bound on 
Pmax("ii A) is imposed here either. 

We now state a large deviation bounds for m-sparse eigenvalues Pmin('^5 ^) and Pma.x{^, ^) for random 
design X = n~^/'^'^A which follows from Theorem 3.2 directly. 

Corollary 3.3. Under conditions in Theorem 3.2, we have with probability at least 1 — 2 exp(— r^n/SOa^), 

(1 - t)^/ Prainim,A) < J p^i^{m,X) < J Praa.x{m,X) < (1 + t) A/pmax(m, ^) . (3.5) 



3.1 Proof of Theorem 3.1 



For n as bounded in (3.1), where m = Tai\i{d{'ikQ, A),p), we have (3.3) holds with r = 5/b. Then by 
Theorem 3.2, we have with probability at least 1 — 2 exp (— nJ^/ (2000a^)) , 



Vm-spai^se vectors u. 
The proof finishes by application of Theorem 2. 1 . 



,< (1+5) Pull,. 



3.2 Proof of Theorem 3.2 



We start with a definition. 

Deflnition 3.4. Given a subset U <ZW and a number e > 0, an e-net HofU with respect to the Euclidean 
metric is a subset of points ofU such that e-balls centered at H covers U: 



UC \J{x + eBP), 



where A + B := {a + b : a £ A,b G B} is the Minkowski sum of the sets A and B. The covering number 
M{U, e) is the smallest cardinality of an e-net ofU. 

The proof of Theorem 3.2 uses two well-known results. The first one is the volumetric estimate; see 
e.g. Milman and Schechtman (1986). 

Lemma 3.5. Given m > 1 and e > 0. There exists an e-net 11 C -B™ of B^ with respect to the Euclidean 
metric such that -B," C (1 — e)~^ conv 11 and |n| < (1 + 2/e)'^. Similarly, there exists an e-net of the 
sphere S'''~^, H' C S""^i such that |n'| < (1 + 2/e)'". 

The second lemma with a worse constant can be derived from Bernstein's inequality for subexponential 
random variables. Since we are interested in the numerical value of the constant, we provide a proof below. 
Lemma 3.6. Let Yi, . . . ,Yn be independent random variables such that 



1, . . . ,n. Then for any 6 G (0, 1) 



1 and 11^11^2 — for all 



1 



> ^ < 2 exp 
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For a set J C {1, • • • ,p}, denote Ej = spanjej : j G J}, and set Fj = AEj. For each subset Fj n S'^ ^, 
construct an e-net IIj, which satisfies 

UjCFjD S'^-^ and |nj| < (1 + 2/e)'". 

The existence of such IIj is guaranteed by Lemma 3.5. If 

n= [J Uj, 

\J\=m 

then the previous estimate implies 

|n| = (3/.rf^)<f^)"^ = expf^logf^)) 
\m J \ me ) \ \ me J J 

For y G S^~^ H Fj C F, let 7r(y) be one of the closest point in the e-cover IIj. Then 

y - 



l|y - 7r(y)||2 



G Fj n 5'^"^ where ||y - 7r(y)||2 < e. 



Denote by ^i, . . . , ^„ the rows of the matrix ^, and set F = n Let a; G S"' ^. Applying Lemma 3.6 

to the random variables ( ^'i, x . . . , ( x we have that for every 9 < 1 



\Tx\\2 - 1 



For 

the union bound implies 

'^3x G n s. t. 
Then for all yo G If 



> 



1 " 
n ^-^ 



x?-l 



>e \ < 2 exp 



(3.6) 



20ma'^ / 3ep\ 
n > — - — log , 



^2 



me 



\Tx\\o — 1 



>e) <2 |n| exp 



nt 



< 2 exp - 



' 20a4 J 



1-9 < \\Tyo\\l <l + 9 and so 
l-0<\\Tyo\\,<l+^- 



with probability at least 1 — 2 exp ^— ^ The bound over the entire S'^ ^ (1 Fj is obtained by 



mation. We have 



IFtt 



2 - \\r{y - 7r{y))\\, < \\ry\\, < \\r7T{y)\\, + ||F(y - 7r(y))||, 



approxi- 



(3.7) 



Define 



\2,Fj-= sup ||ry||2. 

yeSi-^nFj 
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The RHS of (3.7) is upper bounded by 1 + | + e ||r||2 p^^. By taking the supremum over all y G 5'' ^ D Fj, 
we have 

9 \ -\- 9 /2 

2,Fj < 1 + 2 + ^ ll^ll2,Fj and hence ||r||2_p,^^ < • 



The LHS of (3.7) is lower bounded by 1 - 6* - e ||r||2 ^ , and hence for all y G S''^^ D F 



J 



||ry||2 > i-g-g||r||2,^, >i-g- '^\tt^^^ 

Putting these together, we have for all y G 5'^"^ n Fj 

,_,_£(L±^<lir,||,<l^ 

which holds for all sets J. Thus for < 1/2 and e = j^^, 

1-29 < \\Ty\\2 < 1 + 20. 

For any m-sparse vector u G S^~^ 

Au 

II . II e for J = supp (u), 

and so 

(l-20)P^.||2<||r^n||2<(l + 20)Pn||2. 
Taking t = 9/2 finishes the proof for Theorem 3.2. 

3.3 Proof of Lemma 3.6 

Note that a > \\Yi\\^^ > \\Yi\\2 = 1. Using the elementary inequahty t'^ < fcls'^e*/*, which holds for all 
t,s > 0, we obtain 

|E(y/ - < max(Ey/^ 1) < max{kla'^'' • Ee^'/"', 1) < 2k\a^^ 

for any k >2. Since for any j KY^ = 1, for any r G M with |r|a^ < 1 

Eexp [T(y/ - 1)] < 1 + ^ ^|r|^ • |E(y/ - < 1 + |t|'= • 20^*^ 

k=2 ' k=2 

2T^a^ f 2T^a^ 

- 1 + 1 TTT - TT^ 

1 — |T|a^ V 1 ~ |t|q^ 

By Markov's inequality, for r G (0, a^^) 



e-^''" • (Eexp [T(y2 -!)])"< exp (^-r^n + ^^^^^ 



• 1 / V • 1 
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Set r = so ra^ < 1/5. Then the previous inequaUty imphes 



Similarly, considering r < 0, we obtain 



□ 



4 RE condition for random matrices with bounded entries 

We next consider the case of design matrix X consisting of independent identically distributed rows with 
bounded entries. As in the previous section, we reformulate Theorem 1.8 in the form of an almost isometry 
condition. 

Theorem 4.1. Let < 6 < land < sq < p. Let Y be a random vector such that \\Y\\^ < M a.s., 
and denote S = KYY'^. Let X be an n x p matrix, whose rows Xi , . . . , X„ are independent copies of Y. 
Let S satisfy the RE(so, 3/co, S^^^) condition as in Definition LL Set 



2 

d = d{3kQ, T,^''^) = So + Sq max S^'^e 



16K^{so,3ko,J:y^){3ko)H^ko + l) 



6^ 

Assume that d < p and p = Pminid, S^/^) > 0. If for some absolute constant C 

CM^d • logp o fClVPd ■ logp 
n > • log 



pd"^ \ pS^ 

then with probability at least 1 — exp (— (5/3?i/(6M2(i)) all vectors u G Cone(so, ^o) satisfy 



(l-5)||n||2<^;^<(l + 5)||n||2. 



Similarly to Theorem 3. 1, Theorem 4. 1 can be derived from Theorem 2.1, and the corresponding bound for 
d-sparse vector, which is proved below. 

Theorem 4.2. Let Y ^ W be a random vector such that \^\\^ < M a.s., and denote S = WYY"^ . 
Let X be an n X p matrix, whose rows Xi , . . . , Xn are independent copies of Y. Let < m < p. If 

P = Pmini'm-, S-*^/^) > and 

CM'^m-logp o /CM^m -logpN 
" • [ J ' ^^-'^ 
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then with probability at least 1 — 2 exp (~ gMVt ) '^'^ m-sparse vectors u satisfy 



1-S< 



n 



Xu 



|SV2 



u 



<l + 5. 



To prove Theorem 4.2 we consider random variables = ||-^'u||2 / iV^ Il^^^^^ll2) ~ estimate the 

expectation of the supremum of over the set of sparse vectors using Dudley's entropy integral. The proof 
of this part closely follows Rudelson and Vershynin (2008), so we will only sketch it. To derive the large 
deviation estimate from the bound on the expectation we use Talagrand's measure concentration theorem for 
empirical processes, which provides a sharper estimate, than the method used in Rudelson and Vershynin 
(2008). 

Proof. For J C {1, . . . let Ej be the coordinate subspace spanned by the vectors ej, j G J. Set 

F= \J T}I^Ejr\SP-\ 

\J\=m 

Denote ^ = S^^/^X so E^'^'-'^ = id, and let ^'i, ... , ^„ be independent copies of ^. It is enough to show 
that with probability at least 1 — exp (— for any y € F 



< 6. 



To this end we estimate 



A := Esup 

ydF 



The standard symmetrization argument implies that 



Esup 

y&F 



< —Esup 



where ei , . . . , e„ ai^e independent Bernoulli random variables taking values ±1 with probability 1/2. The es- 
timate of the last quantity is based on the following Lemma, which is similar to Lemma 3.6 Rudelson and Vershynin 
(2008). 

Lemma 4.3. Let F be as above, and let i/^i, . . . jipn G Set 



Q = max 

j=l,...,n 



Then 



Esup 

y&F 



^^ii'^^y) 



<d log(^— ^) • sup|^(V,-,y) 



1/2 
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Assuming Lemma 4.3, we finish the proof of the Theorem. First, note that by the definition of ^j, 

< M a.s. 



max 

j=l,...,n 



Hence, conditioning on 'I'l, . . . , and applying Lemma 4.3, we obtain 

1/2 



and by Cauchy-Schwartz inequaUty, 



1/2 / 1/2 

\2 



so 



2 I CmM"^ ■ log n ■ log p ^__(CmM'^\ ^ ■,\i/2 



If n satisfies (4.1), then 

A < 5 • (A + 1)^/^ , and thus A < 25. 

For y G F define a random variable f{y) = , yf - 1. Then \ f{y)\ < {X, T.-'^/'^yf + 1 < Af^p-^m + 
1 := a a.s., because S^^/^y is an m-sparse vector, whose norm does not exceed Set 

n 

Z = sup 

where fi{y), . . . , fn{y) are independent copies of /(y). The argument above shows that EZ < 26n. Then 
Talagrand's concentration inequality for empirical processes Ledoux (2001) reads 

^{Z > t) < exp ( — ^ ) < exp ' 



6a ~ V 6M2 



for all t > 2EZ. Setting t = 46n, we have 

ASnp 



P(sup V ((^'j,y)2 -I) > 45n) < exp 
J/e^,=i 



Similarly, considering random variables g{y) = 1 — y)^, we show that 



P(sup V (1 - (^'j,y)^) > 45n) < exp 

2/6^ , = 1 



4Snp 



which completes the proof of the theorem. □ 
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It remains to prove Lemma 4.3. By Dudley's inequality 



E sup 

y&F 



/•oo 

<C N{F, d, u) du. 

Jo 



Here d is the natural metric of the related Gaussian process defined as 



d{x,y) 



1/2 



< 



where 



< 2R • \\x — yWy , 

1/2 



1/2 



• max \{tpj,x-y)\ 

j=l,...,n 



R = sup \y^{i;j,yf \ , and = max \{ijj,z)\. 

yeF 1 ^ / .7=1,...," 



The inclusion ^/mB\ D |J| j|=m Ej n ^ implies 



Hence, for any y £ F 



T}I^Bl D conv( [J Ej n S^-^) D p^/^F. 

\J\=m 



< p-^/^V^ max S^/V,- = 



j=l,...,n 

Replacing the metric d with the noiTn ||-||y, we obtain 



Esup 

y&F 



< CR 



\og^^^ N{F,\\-\\y,u)du. 



(4.2) 



The upper limit of integration is greater or equal than the diameter of F in the norm so for u > 

p~^l'^^fmQ the integrand is 0. Arguing as in Lemma 3.7 Rudelson and Vershynin (2008), we can show that 



(4.3) 



where 



Cp (maxj=i^..._pmaxj=i_..._„ |(i;V2e.,'0^.) 



log n 



CmQ'^ ■ logn 
pv? 



Also, since F consists of the union (^) Euclidean spheres, the inclusion (4.2) and the volumetric estimate 
yield 



N{F, 



> \V\\Y ' 



u) < 



1 + 



< 



u 



771/ 



1 + 



2p-^l'^^Q 



u 



• (4.4) 
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Estimating the covering number of F as in (4.3) for u > 1, and as in (4.4) for < u < 1, we obtain 



E sup 

y&F 



< CR • log 



+ log 1 + 



1/2 



du 



+CR 



-1/2 



/ CmQ2.log 



n 



■ \/log 2p du 



< CRi 



I mQ"^ ■ log n ■ logp 



log 



V P 



□ 



Remark 4.4. Note that unlike the case of a random matrix with subgaussian marginals, the estimate of 
Theorem 4.2 contains the minimal sparse singular value p. This is, however, necessary, as the following 
example shows. 

Let m = 2\ and assume that p = k ■ m, for some G N. For j = 1, . . . , k let Dj be the m x m Walsh 
matrix. Let Abe a p x p block-diagonal matrix with blocks Di, . . . , Dj. on the diagonal, and let Y ^'W be 
a random vector, whose values are the rows of the matrix A taken with probabilities \/p. Then \\Y\\^ = 1 
andWYV'^ = {m/p) ■ id, so p = m/p. Hence, the right-hand side of (4.1) reduces to 



Cp ■ log p 
52 



•log' 



Cp ■ log p 



From the other side, if the matrix X satisfies the conditions of Theorem 4.2 with, say, 5 = 1/2, then all rows 
of the matrix A should be present among the rows of the matrix X. An elementary calculation shows that 
in this case it is necessary to assume that n > Cplogp, so the estimate (4.1) is exact up to a power of the 
logarithm. 

Unlike the matrix S, the matrix A is not symmetric. However, the example above can be easily modified by 
considering a2p x 2p matrix 



This shows that the estimate (4.1) is tight under the symmetry assumption as well. 



References 

Adamczak, R., Latala, R., Litvak, a. E., , Pajor, a. and Tomczak-Jaegermann, N. (2011). 
Geometry of log-concave ensembles of random matrices and approximate reconstruction. 1103. 0401 vl. 

Adamczak, R., Litvak, A. E., , Pajor, A. and Tomczak-Jaegermann, N. (2009). Restricted 
isometry property of matrices with independent columns and neighborly polytopes by random sampling. 
0904.4723vl. 



27 



Baraniuk, R. G., Davenport, M., DeVore, R. A. and Wakin, M. B. (2008). A simple proof of the 
restricted isometiy property for random matrices. Constructive Approximation 28 253-263. 

BiCKEL, R J., RiTOV, Y. and TSYBAKOV, A. B. (2009). Simultaneous analysis of Lasso and Dantzig 
selector. The Annals of Statistics TH 1705-1732. 

BUNEA, R, TSYBAKOV, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. The 
Electronic Journal of Statistics 1 169-194. 

Cai, T., Wang, L. and Xu, G. (2010). Stable recovery of sparse signals and an oracle inequality. IEEE 
Transactions on Information Theory 56 3516-3522. 

Candes, E. and Plan, Y. (2009). Near-ideal model selection by .1 minimization. Annals of Statistics 37 
2145-2177. 

Candes, E., Romberg, J. and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate 
measurements. Communications in Pure and Applied Mathematics 59 1207-1223. 

Candes, E. and Tao, T. (2005). Decoding by Linear Programming. IEEE Trans. Info. Theory 51 4203- 
4215. 

Candes, E. and Tao, T. (2006). Near optimal signal recovery from random projections: Universal encod- 
ing strategies? IEEE Trans. Info. Theory 52 5406-5425. 

Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. 
Annals of Statistics 35 2313-2351. 

Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. 
SIAM Journal on Scientific and Statistical Computing 20 33-61. 

Donoho, D. (2004). For most large underdetermined systems of equations, the minimal £i-norm near- 
solution approximates the sparsest neai^-solution. Tech. rep., Stanford University. 

DONOHO, D. (2006a). Compressed sensing. IEEE Trans. Info. Theory 52 1289-1306. 

Donoho, D. (2006b). For most large underdetermined systems of equations, the minimal £i-norm solution 
is also the sparsest solution. Communications in Pure and Applied Mathematics 59 797-829. 

Donoho, D. L. and Johnstone, L M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 
81 425^55. 

Duncan, G. and Pearson, R. (1991). Enhancing access to microdata while protecting confidentiality: 
Prospects for the future. Statistical Science 6 219-232. 

Gordon, Y. (1985). Some inequalities for gaussian processes and applications. Israel Journal of Mathe- 
matics 50 265-289. 

Greenshtein, E. and RiTOV, Y. (2004). Persistency in high dimensional hnear predictor-selection and 
the virtue of over-parametrization. Bernoulli 10 971-988. 



28 



KOLTCHINSKII, V. (2009a). Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799-828. 

KOLTCHINSKII, V. (2009b). Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincare 
Probab. Statist. 45 7-57. 

Ledoux, M. (2001). The concentration of measure phenomenon. Mathematical Surveys and Monographs, 
89. American Mathematical Society. 

Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the 
Lasso. Annals of Statistics 34 1436-1462. 

Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional 
data. Annals of Statistics 37 246-270. 

Mendelson, S., Pajor, a. and Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian 
operators in asymptotic geometric analysis. Geometric and Functional Analysis 17 1248-1282. 

Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2008). Uniform uncertainty principle for 
bernouUi and subgaussian ensembles. Constructive Approximation 28 277-289. 

MiLMAN, V. D. and SCHECHTMAN, G. (1986). Asymptotic Theory of Finite Dimensional Normed Spaces. 
Lecture Notes in Mathematics 1200. Springer. 

PiSlER, G. (1981). Remarques sur un resultat non public de b. Maurey. Seminar on Functional Analysis, 
Ecole Polytech., Palaiseau . 

Raskutti, G., Wainwright, M. and Yu, B. (2009). Minimax rates of estimation for high-dimensional 
linear regression over ^^-balls. InAllerton Conference on Control, Communication and Computer. Longer 
version in arXiv:0910.2042vl.pdf. 

Raskutti, G., Wainwright, M. and Yu, B. (2010). Restricted nullspace and eigenvalue properties for 
correlated gaussian designs. Journal of Machine Learning Research 2241-2259. 

Rauhut, H., Schnass, K. and Vandergheynst, P. (2008). Compressed sensing and redundant dictio- 
naries. IEEE Transactions on Information Theory 54 2210-2219. 

Rudelson, M. and Vershynin, R. (2005). Geometric approach to error correcting codes and recon- 
struction of signals. International Mathematical Research Notices 4019^041. 

Rudelson, M. and Vershynin, R. (2006). Sparse reconstruction by convex relaxation: Fourier and 
gaussian measurements. In 40th Annual Conference on Information Sciences and Systems ( CISS 2006). 

Rudelson, M. and Vershynin, R. (2008). On sparse reconstruction from fourier and gaussian measure- 
ments. Communications on Pure and Applied Mathematics 1025-1045. 

TIBSHIRANI, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser B 58 
267-288. 



29 



VAN DE Geer, S. and Buhlmann, P. (2009). On the conditions used to prove oracle results for the lasso. 
Electronic Journal of Statistics 3 1360-1392. 

VAN DE Geer, S. A. (2008). High-dimensional generalized linear models and the Lasso. The Annals of 
Statistics 36 614-645. 

Vershynin, R. (2011a). Approximating the moments of marginals of high dimensional distributions. 

Annals of Probability, to appear . 

Vershynin, R. (2011b). How close is the sample covariance matrix to the actual covariance matrix? 
Journal of Theoretical Probability, to appear . 

Wainwright, M. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using £i- 
constrained quadratic programming. IEEE Trans. Inform. Theory 55 2183-2202. 

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional 
lineal- regression. Annals of Statistics 36 1567-1594. 

Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine Learning 
Research 7 2541-2567. 

Zhou, S. (2009a). Restricted eigenvalue conditions on subgaussian random matrices. ArXiv:0904.4723v2. 

Zhou, S. (2009b). Thresholding procedures for high dimensional variable selection and statistical estima- 
tion. In Advances in Neural Information Processing Systems 22. MIT Press. 

Zhou, S. (2010). Thresholded lasso for high dimensional variable selection and statistical estimation. 
ArXiv:1002.1583v2. 

Zhou, S., Lafferty, J. and Wasserman, L. (2009a). Compressed and privacy sensitive sparse regres- 
sion. IEEE Transactions on Information Theory 55 846-866. 

Zhou, S., Rutimann, P., Xu, M. and Buhlmann, P. (2011). High-dimensional covariance estimation 
based on Gaussian graphical models. Journal of Machine Learning Research, to appear . 

Zhou, S., van de Geer, S. and Buhlmann, P. (2009b). Adaptive Lasso for high dimensional regression 
and gaussian graphical modeling. ArXiv:0903.2515. 



30 



